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An information-theoretic framework known as integrated information theory (IIT) has been in- 
troduced recently for the study of the emergence of consciousness in the brain [D. Balduzzi and 
G. Tononi, PLoS Comput. BioL 4, el000091 (2008)]. IIT purports that this phenomenon is to 
be equated with the generation of information by the brain surpassing the information which the 
brain's constituents already generate independently of one another. IIT is not fully plausible in its 
modeling assumptions, nor is it testable due to severe combinatorial growth embedded in its key 
definitions. Here we introduce an alternative to IIT which, while inspired in similar information- 
theoretic principles, seeks to address some of IIT's shortcomings to some extent. Our alternative 
framework uses the same network-algorithmic cortical model we introduced earlier [A. Nathan and 
V. C. Barbosa, Phys. Rev. E 81, 021916 (2010)] and, to allow for somewhat improved testability 
relative to IIT, adopts the well-known notions of information gain and total correlation applied to 
a set of variables representing the reachability of neurons by messages in the model's dynamics. We 
argue that these two quantities relate to each other in such a way that can be used to quantify the 
system's efficiency in generating information beyond that which does not depend on integration, and 
give computational results on our cortical model and on variants thereof that are either structurally 
random in the sense of an Erdos-Renyi random directed graph or structurally deterministic. We 
have found that our cortical model stands out with respect to the others in the sense that many 
of its instances are capable of integrating information more efficiently than most of those others' 
instances. 

PACS numbers: 87.18.Sn, 87.19.lj, 89.75.Fb 



I. INTRODUCTION 

Explaining the emergence of consciousness out of the 
massive neuronal interactions that take place in the brain 
is the greatest unsolved problem in neuroscience. It defies 
our ability to define what it means for the brain to be 
in a conscious state and, lacking a definition, also our 
ability to pinpoint the mechanisms that give rise to such 
a state and its evolution. The most recent player in the 
quest for a framework for consciousness studies is the 
integrated information theory (IIT) This theory is 
information-theoretic in nature and seeks to characterize 
consciousness on the formal grounds of how information 
originating at different parts of the brain gets integrated 
as neurons interact with one another. IIT has been met 
with enthusiasm (cf., e.g., [0]), but substantial further 
developments are needed to help clarify whether this is 
justified. 

IIT is defined on a directed graph having a node for 
each of a group of variables. These variables' values 
evolve in lockstep (he., in discrete time, much as in a 
cellular automaton [3|) in such a way that, at time t -I- 1, 
the value of a particular variable is a function of its own 
value and of those of its in-neighbors in the graph at 
time t. Each node is thus assumed to have a local func- 
tion that it applies on inputs to get an output per time 
unit. Typically a variable's possible values are or 1 and 
a node's local function is one of the elementary logical 
operations. The basic tenet of IIT is that consciousness 
is to be equated with the surplus of information that the 
system is capable of generating, relative to the total in- 



formation that is generated by its parts independently of 
one another, as it evolves from an initial state of maxi- 
mum uncertainty to a final state. The use of "parts" here 
refers to a specific partition of the set of variables. The 
information surplus that IIT considers is the minimum 
over all possible partitions. 

Unless a lot of regularity is present, computing this 
minimum information surplus requires a number of par- 
titions to be examined that is given in the worst case 
by the Bell number corresponding to the number of vari- 
ables. The Bell number for as few as 20 variables, say, 
is already of the order of 10"'^^ so the task at hand is 
computationally intractable even for modestly sized sys- 
tems. Another potential obstacle to the success of IIT in 
eventually fulfilling the promise of helping characterize 
the emergence of consciousness is the apparent oversim- 
plistic character of some of its elements. In our view, 
these include the use of binary variables and operations 
(even if probabilistic), and also the assumption that the 
system evolves in time in a synchronous and memoryless 
fashion. 

Here we study the emergence of information integra- 
tion while striving both to adhere to the spirit of IIT and 
to address its potential shortcomings. The main elements 
of our approach are the following. 

(a) We adopt a cortical model with ample provisions 
for randomness, asynchrony, and neuronal memory. This 
model is the same we used previously in [sj, having a 
structural component and a functional one. The struc- 
tural component is a random directed graph and at- 
tempts to portray, to the fullest possible extent, what- 
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ever structural characteristics cortices can at present be 
said to have. The model's functional component, in turn, 
prescribes a randomized distributed algorithm to run on 
the graph. This algorithm uses message passing to mimic 
inter-neuron signaling over synapses. It is also fully asyn- 
chronous, meaning that local actions are triggered by 
message arrivals independently of what is happening else- 
where in the graph. At each node, the algorithm is ca- 
pable of providing enough bookkeeping for some of the 
node's history to be influential on current actions. 

(b) We adopt two interrelated indicators of information 
integration. The first is simply information gain, that 
is, the amount of information the system generates as a 
single entity from an initial situation of maximum uncer- 
tainty. The second is total correlation, which in essence 
indicates by how much the first indicator surpasses the 
nodes' total gain of information when each node's gain 
is considered independently of all others'. Unlike IIT, 
the variables involved in our approach do not bear di- 
rectly on the fire/hold dichotomy of binary local states 
but on whether the corresponding nodes are reached by 
messages as the computation unfolds. 

The distributed algorithm in (a) requires at least one 
node to behave non-reactively at the beginning of a run. 
That is, at least one node must have the chance to send 
messages out spontaneously without any incoming mes- 
sage to trigger its actions. The algorithm admits any 
number of such initiators to act concurrently. Choos- 
ing the nodes to do it in each run is a random pro- 
cess. Together with the randomness already present in 
the graph's structure and in the algorithm, this random 
choice of initiators leads to an assessment of the indica- 
tors described in (b) that takes place by averaging them 
over a number of graphs and/or a number of runs on each 
graph, each run with a new set of initiators. Vis-a-vis the 
treatment of information integration in IIT, which makes 
reference to an optimal partition of the set of variables, 
our approach is to generate a great number of message- 
flow patterns and to measure information integration as 
an expected, rather than optimal, quantity. 

We regard the present work as being fully in line with 
several others that have recently attempted to draw on 
graph-based methods to help solve problems in neuro- 
science [Gl-fTBj. These works are all based on highly ab- 
stract models of the underlying biological system, but 
some researchers believe that a complete understanding 
of the system's properties can only come from consid- 
ering every possible detail, even down to the molecular 
level. So a sort of methodological chasm is beginning to 
appear, as documented in the news item found in (l6j . As 
in the present work's predecessor Q, here we adopt what 
might be called the artificial-life stance [l^l , which essen- 
tially posits the middle alternative of employing only as 
much modeling detail as required to let some "life as it 
could be" properties emerge. Vague though this sounds, 
the cortical model we use has been shown to give rise 
to some such properties [5^. Specifically, by relying on 
the combination of its two main components (one struc- 



tural, the other functional), our model gives rise, with 
excellent agreement, to experimentally obtained lognor- 
mal distributions of synaptic strengths [l^. Moreover, 
by including enough detail of inter-neuron signaling so 
that the all-important local histories 19] can always be 
retrieved for careful examination, our model also reveals 
signs of the very rich dynamics that everyone agrees must 
underlie all cortical functions. 

We proceed according to the following layout. The 
two components of our cortical model are reviewed in 
Sees, nil and mil Then we move, in Sec. IIVI to a descrip- 
tion of the information-integration indicators to be used. 
We give computational results in SecfVland follow these 
with discussion and conclusions, in Sees. IVII and IVIIi re- 
spectively. 



II. NETWORK MODEL 

The structural portion of our cortical model is the same 
as in It consists of a directed graph D having n nodes, 
one for each neuron. In D, an edge leading from node i to 
node J indicates that a synapse exists between the axon of 
the neuron that node i represents and one of the dendrites 
of the neuron represented by node j. The existence of 
such an edge, therefore, amounts to the possibility of 
direct causal influence of what happens at node i upon 
what happens at node j. In the same vein, indirect causal 
influence of what happens at node i upon what happens 
at farther nodes can also exist, provided those nodes can 
be reached from i through directed paths of 13. If i is part 
of any directed cycle in D, then it follows that present 
events at node i can causally influence future events at 
the same node also through the indirect mediation of all 
other nodes in the cycle. 

We regard D as a originating from a random-graph 
model, so completing its definition requires that we spec- 
ify how the out-degree of a randomly chosen node (its 
number of out-neighbors) is distributed, and also the 
probability that one of these out-neighbors is another 
randomly chosen node (this, indirectly, specifies the dis- 
tribution of a randomly chosen node's in-degree, its num- 
ber of in- neighbors). Still following Q, we assume that a 
randomly chosen node has out-degree A: > with proba- 
bility proportional to k^^-^. The adoption of a scale- free 
la w I20I w ith this particular exponent follows the work 
in |2ll . I22I ] , but one should mind the caveat given below. 
If i is the randomly chosen node in question, what is 
left to specify is the probability that each out-neighbor 
of i is precisely another randomly chosen node, say j. 
Taking inspiration from the work in [23l [2^ , first we as- 
sume that the nodes of D are placed uniformly at random 
on a radius- 1 sphere. If dij is the resulting Euclidean 
distance between i and j, then each out- neighbor of i 
coincides with j with probability proportional to e^'^'^ , 
with A < a constant. This constant affects the size 
of Z3's giant strongly connected component (GSCC) [l^ 
heavily. The GSCC of D is the largest subgraph of D 
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in which a directed path exists between any two nodes, 
that is, the largest subgraph in which ah nodes have the 
potential of exerting direct or indirect causal influence 
upon all others. Similarly to what is explained in Q, we 
choose A = — 1 so that the expected number of nodes 
in the GSCC is about 0.9n. Henceforth, we limit all 
our information-integration investigations to within the 
GSCC of L». 

The aforementioned caveat is the following. Even 
though the graph-theoretic modeling of cortices has made 
great progress |26| since the earliest attempts (as repre- 
sented, e.g., by [23], where the random graphs of Erdos 
and Renyi j28| were used directly), our adoption of a 
scale-free structure is far from any form of consensus. 
This is not to say that cortices have no scale-free proper- 
ties: in fact, it has been argued that they do indeed have 
such properties [l^ Isollsol , including the so-called small- 
world characteristics [3l|, the presence of hubs (nodes 
with a great number of out-neighbors) , and many others 
[33 |. What is meant, instead, is that there exist results 
pointing in contradictory directions. The recent growth 
model in for example, gives rise to an out-degree 
distribution that is not scale- free. If, on the other hand, 
we concede that this model is not fully justified biologi- 
cally and look instead for topological characteristics de- 
rived from measurements on real cortices, what we find 
is not at the level of detail that we need (i.e., the level 
of neuronal wiring). Our sources for the k^^'^ power law 
[21I [22I , for example, adopt the granularity of functional 
parts of the cortex. The latest available mapping 33] (in 
fact the most comprehensive to date), in contrast, adopts 
structural (rather than functional) granularity and leads 
to the conclusion of an exponential (rather than a power- 
law) distribution. 

Another problem with these measurement-based char- 
acterizations is that all the reported distributions are in 
fact distributions of degrees, not out-degrees. That is, 
what counts for each node is its total number of neigh- 
bors (in- and out- neighbors combined). While in j5j we 
ignored this and adopted the power law of [2l|, [23 as 
the only measurement-based distribution available at the 
time, the more recent results in [sl] lend, somewhat sur- 
prisingly, new support to our choice of this power law. 
In fact, we have found empirically that the degree distri- 
bution that results from our assumed out-degree distri- 
bution and distance-based deployment of directed edges 
can be approximated by an exponential over a significant 
range of degrees. This can be seen in Fig. [1] where the 
distribution of the combined in- and out-degrees of the 
nodes of D is shown. In our view, this provides all the jus- 
tification we can have at this point regarding our choice of 
the random-graph model. Further justification (or, more 
likely, adaptation) will depend on wiring-level measure- 
ments, whose availability, to the best of our knowledge, 
is still not foreseeable. 
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FIG. 1: Distribution of a node's number of neighbors (in- and 
out-neighbors combined) in our cortical model for n — 100 
(a) and n — 1000 (b). Probabilities are shown for the com- 
plementary cumulative distribution; that is, given a number 
fc > of neighbors, we show the probability that a randomly 
chosen node has strictly more than k neighbors. Data are 
averages over 1 000 graphs for each value of n. 



III. NETWORK ALGORITHMICS 

In order to fully describe the cortical model we use in 
the present study, the structural properties of graph D 
given in Sec. |ll] need to be complemented with further, 
functional properties of the graph's nodes and edges. A 
goal of the resulting model is to provide an algorithmic 
abstraction of cortical functioning that can mimic, to 
some extent, the buildup of potential at each neuron as its 
dendrites are reached by action potentials traveling down 
other neurons' axons, as well as the eventual firing that 
this buildup entails with the accompanying action poten- 
tial that travels down the neuron's own axon. Another 
goal is to simulate the dynamics of synaptic strengths as 
they vary in the wake of neuronal firing. 

We provide the necessary functional component of the 
model in the form of an asynchronous distributed algo- 
rithm ^33]. In general, such an algorithm assumes that 
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the nodes of D are capable of receiving messages from 
their in-neighbors in the graph, of performing local com- 
putation on the information that is thus received, and 
finally of sending messages to their out-neighbors. This 
processing may occur concomitantly at several of the 
graph's nodes and edges and this is what gives the algo- 
rithm its distributed character. What gives it its asyn- 
chronous character, in turn, is the underlying assumption 
that the local computation at each node makes no refer- 
ence to temporal quantities of a global nature. Such in- 
herent locality gives asynchronous distributed algorithms 
a clear connection to most complex-network studies of 
the past decade, as evinced by the various contributions 
collected in |35l - [37l |. Surprisingly, though, to the best of 
our knowledge the interplay of structure and function, 
and its role in giving rise to global network properties, 
has seldom been explored. One notable example is the 
work in (38| , where efficient information dissemination is 
shown to emerge from strictly local decisions. Another 
example is provided by the present study and its prede- 
cessor [1]. 

The algorithm we use is the one introduced in @ . As 
mentioned in Sec. HI here we assume that algorithm to be 
sufhciently certified for use in the cortical model we study 
because it is in Q shown to be capable of giving rise, 
among other things, to experimentally observed distribu- 
tions of synaptic strengths. This algorithm, henceforth 
referred to simply as A, uses Vj to represent the potential 
of node j and Wij to represent the synaptic strength of 
the edge directed from node i to node j. As customary, 
we refer to each Wij as a synaptic weight. We also as- 
sume that all nodes share the same rest and threshold 
potentials, denoted by and (with u° < w'), respec- 
tively, and that every synaptic weight lies in the interval 
[0, 1]. Finally, nodes can be excitatory or inhibitory and 
D contains no edge connecting two inhibitory nodes [l^l . 

Our specification of algorithm A is based on the fol- 
lowing procedure, called Fire, which gives a probabilistic 
rule for node j to fire, that is, to send messages to its 
out-neighbors and reset its potential to the rest poten- 
tial. Each of these messages is to be interpreted as the 
signaling by j to one of its out-neighbors, through the 
corresponding synapse, that results from node j's firing 
and the ensuing action potential. Procedure Fire uses a 
probability parameter, p. 

Procedure Fire: 
Fire with probability p: 

1. Send a message to each out-neighbor of node j. 

2. Set Vj to w°. 

Node j participates in a run of algorithm A through 
a series of calls to procedure Fire with suitable p values. 
In the first call in this series node j may be one of the so- 
called initiators of the run. In this case its participation 
is restricted to calling Fire with p =\. All calls in which 
j is not an initiator are reactive, in the sense that node 



j only computes upon receiving a message from some in- 
neighbor. In this case, the call to Fire is part of a larger 
set of rules, as given next when node i is the in-neighbor 
in question. In this larger set, the call to procedure Fire 
is preceded by an alteration to the potential Vj and fol- 
lowed, possibly, by an alteration to the synaptic weight 

Algorithm A (reactive mode): 

1. If i is excitatory, then set Vj to min{u*,Wj -I- Wy }. 

2. If i is inhibitory, then set Vj to maxlw*^, — Wij}. 

3. Call procedure Fire with p = {vj — u°) / (u* — v'^). 

4. If firing did occur during the execution of Fire, then 
set Wij to min{l, Wij + S}. 

5. If firing did not occur during the execution of Fire 
but the previous message received by node j from 
any of its in-neighbors did cause j to fire, then set 
Wij to (1 — a)wij. 

In steps 4 and 5 of algorithm A, 6 > and a such that 

< a < 1 are parameters meant to let the algorithm 
follow, though only to a limited extent, the principles of 
spike-timing-dependent plasticity [H, li^l . These princi- 
ples dictate, as a general rule, that the synaptic weight 
is to increase if firing occurs, decrease otherwise, always 
as a function of how close in time the relevant firings 
by nodes i and j are. Moreover, increases are to occur 
by a fixed amount, decreases by proportion (4l| - |43| . As 
explained in 5 and a must be such that S < a. 

Clearly, any nontrivial run of algorithm A (i.e., one 
in which at least one message is sent) requires at least 
one node to behave as an initiator in the first of its calls 
to procedure Fire. Henceforth, we let m < n be the 
number of nodes that do this, that is, the number of 
initiators in the run. It should also be clear, by steps 

1 and 2 of algorithm A, that setting the initial value of 
Vj to some number in the interval [v^,v*] ensures that 
Vj remains in this interval perpetually (this guarantees 
that the value to which probability p is set in step 3 is 
always legitimate). Likewise, it follows from steps 4 and 
5 that, if initialized to some number in the interval [0, 1], 
weight Wij remains constrained to lie in this interval for 
the whole run. 

Any run of algorithm A terminates eventually with 
probability 1. That is, there necessarily comes a time 
during the run at which no more messages are sent and, 
from then on, no further processing occurs at the nodes. 
When graph D is strongly connected (i.e., a directed path 
exists from any of its nodes to any other) , then any firing 
during the run causes messages to be sent. Similarly, 
any firing by a node that is not acting as initiator is 
preceded by accumulating and/or depleting alterations 
to the node's potential, as messages arrive, relative to the 
value it had initially or when the node last fired. Message 
traffic, therefore, provides the essential backdrop against 
which to conduct our study of information integration. 



5 



IV. INFORMATION INTEGRATION 

We consider TV discrete random variables, denoted by 
Xi, X2, . . . , Xtv, each taking values from the set {0, 1}. 
All of our study on how information gets integrated in 
a directed graph running the distributed algorithm A of 
Sec, mil is based on attaching meaning to these variables, 
of which there is one for each of < n nodes of graph 
D, and to their distributions. We do this later in this 
section and also in Sec. |Vl First, though, we establish 
the two indicators of information integration that will be 
used. 

For the sake of notational conciseness, we use X to 
denote the whole sequence Xi, A2, . . . , Ajv of variables, 
and likewise x G {0,1}^ to denote one of the possible 
2^ sequences of values Xi,X2, ■ ■ ■ ,xn & {0, 1}, each for 
the corresponding variable. Unambiguously, then, X = x 
means that Xi — xi,X2 = X2, ■ ■ ■ ,Xpf — x^. If P(x) 
is the joint probability that X = x, then we use Pi{xi) 
to denote the marginal probability that Xi — Xi for all 
i G {1,2,..., N}. Clearly, Pi{xi) is given by the sum of 
P(x) over all 2^^^ possibilities for x that leave the value 
of Xi fixed at Xi. 

Ultimately, our indicators of information integration 
are expressible in terms of the Shannon entropy asso- 
ciated with the sequence X of variables given the joint 
distribution P, or with each individual variable Xi given 
the corresponding marginal distribution Pi . This entropy 
gives, in (information-theoretic) bits, a measure of how 
much unpredictability the distribution embodies regard- 
ing the values of the variables. We denote the joint en- 
tropy by i?(X) and each marginal entropy by Hi{Xi). 
They are given by the well-known formulae 

HiX) = - J2 ^(x)log2P(x) (1) 
xe{o,i}" 

and 

iJ,(AO-- J2 P^{x^)\0g2P^{Xi). (2) 
3:,e{0,l} 

Recall that entropy is a function of the distribution and 
is maximized when the distribution is uniform over its 
domain. Thus, < H{X) < Af and < H,{X,) < 1. 

The way our indicators become expressed as com- 
binations of entropies is through another fundamental 
information-theoretic notion, that of the relative entropy, 
or Kullback-Leibler (KL) divergence, of two distributions 
(4^ . Given two joint distributions P and Q over the same 
set of N variables as above, the KL divergence of P rel- 
ative to Q, here denoted by -D(P, Q), is given by 

D{P,Q)^ P(x)log2^, (3) 

x6{0,l}" 

provided Q(x) > whenever P(x) > 0. We have 
Z)(P, Q) = if and only if P and Q are the same dis- 
tribution. Otherwise D{P, Q) > 0, so the KL divergence 



functions as a measure of how different the two distribu- 
tions are [though, in general, D{P,Q) ^ _D((5,P)]. 



A. Information gain 

The first of our two indicators, information gain, is the 
KL divergence of P relative to Q when the latter reflects 
a state of maximum unpredictability regarding the values 
of the A variables. That is, we use (3(x) = 1/2^ for all 
X £ {0, 1}^. We denote information gain by G'(X) and 
it follows from Eq. ([3]) that 

G(X) = A - iJ(X). (4) 

Evidently, < G'(X) < A. 

A marginal version of information gain for Xi can also 
be defined by recognizing that Qi(0) = Qi(l) = 0.5. De- 
noting this marginal information gain by G'i(Ai), we have 
G,(A,) =7^(P„g,), whence 

G,(Xi) = \-U,{Xi) (5) 

and < G^iXi) < 1. 

Our use of information gain will be based on letting A 
be the number of nodes in the graph's GSCC, that is, one 
variable per node in the GSCC of graph D. Moreover, 
Pi(l) will be the probability that node i receives at least 
one message during a run of algorithm Aon D. Similarly, 
P(x) will be the probability that every node i for which 
Xi — 1 (and no other node) receives at least one message 
during the run. 

B. Total correlation 

Our second indicator uses (5(x) = YliLi Pii^i) for 
X G {0, 1}^. That is, it addresses the question of how far 
the variables Ai , A2, . . . , Ajv are from being independent 
from one another relative to P. Given this choice for the 
joint distribution Q, the KL divergence D{P, Q) becomes 
what is known as the total correlation among the A vari- 
ables '45'], henceforth denoted by C(X). It follows from 
Eq. © that 

N 

C(X)=5]i/,(A,)-F(X) (6) 

i=l 

[ssj . Like entropy, total correlation is expressed in bits 
and is a function of the joint distribution P. It is maxi- 
mized whenever P assigns zero probability to all but two 
of the members of {0, 1}^: if x and y are the two ex- 
ceptions, then maximization occurs if x and y are com- 
plementary value assignments to the variables (that is, 
for all i it holds that a;^ = if and only if j/i = 1) and 
moreover P(x) = P(y) = 0.5. Under these conditions, 
clearly H,{Xi) ^ 1 for aU i and H{X) = 1. Therefore, 
< C(X) < A - 1. 
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ByEqs. Q-®, wehaveC(X) = G{X)-J2Z^ G,{X,). 
That is, total correlation is the amount of informa- 
tion gain that surpasses the total gain provided by 
the variables separately. Equivalently, information gain 
comprises total correlation and the total marginal gain 



EtiG.(X,),i.e., 



N 



G(X) = C(X) + ^G,(X,). 



(7) 



Our use of total correlation will also be based on let- 
ting N be the number of nodes in the GSCC of graph 
D. Moreover, both Pi(l) and P(x) will have the same 
meanings as given above for information gain. We will 
also use the ratio 



r(X) 



g(X) 
G(X) 



(8) 



as an indicator of how conducive graph D is, under algo- 
rithm A, to generating information in the form of total 
correlation. 



C. Expected values 

Running the distributed algorithm A of Sec. IIIII on 
graph D from a set of initiators alters the edges' synaptic 
weights and, along with them, the joint distribution P. 
As P changes, so do G(X) and G(X) and, in interpreting 
the results of Sec. El it will be useful to have G(X) and 
G(X) values against which to gauge the values that we 
obtain. Using the maximum values given above for each 
quantity is of little meaning, since they occur only at 
finitely many possibilities for P while P itself varies over 
a continuum of possibilities. 

We then look at the expected value of either quantity 
as P varies. To do so, we first note that specifying P 
is equivalent to specifying 2^ numbers in the interval 
[0, 1], provided they add up to 1. In other words, P can 
be identified with each and every point of the standard 
simplex in 2^-dimensional real space. Calculating the 
expected value of either G(X) or G(X) over this simplex 
requires the choice of a density function and then an in- 
tegration over the simplex. Given the complexity of both 
the distributed algorithm and the structure of £*, it seems 
unlikely that a suitable density function can be derived. 
Moreover, even if we assume the uniform density instead, 
there is still the task of integrating G(X) and G(X) over 
the simplex, which to the best of our knowledge can be 
done analytically for G(X), through the expected value of 
H{X.) (cf. [461] and references therein), but not for G(X). 

For sufficiently large N, it follows from the formula 
in [i^ that the expected value of H{X.) over the simplex 
using the uniform density tends to jV— (1— 7) / In 2, where 
7 » 0.57722 is the Euler constant [ij. Therefore, by 
Eq. (21) the expected value of G(X) tends to the constant 
(1 — 7)/ In 2 w 0.6. Similarly, it follows from Eq. ^ that 



0.6 can also be taken as an approximate upper bound on 
the expected value of G(X). We also know from (46|] that, 
under these same conditions, H{X.) is tightly clustered 
about the mean, and thus so is G(X). 



V. COMPUTATIONAL RESULTS 

The methodology we follow in our computational ex- 
periments is entirely analogous to the one introduced in 
[5| . The central entity in this methodology is a run of al- 
gorithm A of Sec. |llll using u" = -15, -u' = 0, (5 = 0.0002, 
and a — 0.04 at all times. A run is started by m = 50 ini- 
tiators chosen uniformly at random and progresses until 
termination. These values of S and a are the same that in 
i5:] were shown to allow the synaptic weights to become 
distributed as observed experimentally. As for w", w*, 
and m, their values only regulate the traffic of causally 
disconnected messages in the graph and therefore only in- 
fiuence how early in a sequence of runs global properties 
can be expected to emerge. 

For a fixed graph D, first we decide for each of the 
A'' nodes in the graph's GSCC whether it is to be exci- 
tatory or inhibitory. This is done uniformly at random, 
provided no two inhibitory nodes are directly connected 
to each other. We use the widely accepted proportion of 
20% for the number of inhibitory nodes in D [23, H^j . All 
runs on graph D operate on this fixed set of inhibitory 
nodes. Then we choose initial node potentials and synap- 
tic weights uniformly at random from the intervals [w", w*] 
and [0,1], respectively. We group all runs on graph D 
into sequences. The first run in a sequence starts from 
the initial node potentials and synaptic weights that were 
chosen for the graph. Each subsequent run starts from 
the node potentials and synaptic weights left by the pre- 
vious run. We use 50 000 sequences for each graph D, 
each sequence comprising 10 000 runs. We adopt eleven 
observational checkpoints along the course of each se- 
quence. The first one occurs right at the beginning of 
the sequence, before any run takes place, so node poten- 
tials and synaptic weights are still the ones chosen ran- 
domly. The remaining ten checkpoints occur each after 
1 000 additional runs in the sequence. 

The purpose of each checkpoint is to allow the joint 
distribution P of the variables Ai, A2, . . . , Xn to be es- 
timated and, based on it, the calculation of informa- 
tion gain G(X) and total correlation G(X). Since the 
marginal Pi{l) is to reflect the probability that node i 
receives at least one message during a run, what is done 
at each checkpoint is to observe the message propagation 
patterns that take place on graph D as algorithm A is 
executed on it. We do this by resorting to 100 side runs 
of the algorithm, each beginning with the choice of a new 
set of m initiators uniformly at random and starting from 
the node-potential and synaptic-weight values that are 
current at the checkpoint. At the end of all side runs, 
the main sequence of runs is resumed from these same 
values. For c = 1, 2, . . . , 11, the joint distribution P cor- 
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responding to the cth checkpoint can then be estimated 
from the overall number of side runs, which is 5 x 10^. For 
the purpose of averaging the resulting G(X) and C(X) 
values, multiple instances of graph D are needed. This is 
so in order to account for structural variations and varia- 
tions in the excitatory /inhibitory character of each node 
(in case the graphs come from sampling from a random- 
graph model), and for variations in the initial node po- 
tentials and synaptic weights (in all cases, including the 
single case in which the structure of D is deterministic; 
cf. below). 

Estimating P at the cth checkpoint proceeds as fol- 
lows. After each side run of A, the point x G {0, 1}^ 
such that a;^ = 1 if and only if node i received at least one 
message during the run has its number of occurrences in- 
creased by 1. Straightforward normalization yields P(x) 
after all sequences have reached that checkpoint on the 
graph in question and the corresponding side runs have 
terminated. This poses a somewhat severe storage prob- 
lem, since for each D we execute the sequences one after 
the other while handling the graphs in parallel (on differ- 
ent processors). Therefore, the accumulators correspond- 
ing to the various members of {0, 1}''^ that are actually 
observed have to be stored concomitantly for all eleven 
checkpoints. There is no choice but to use external (i.e., 
disk-based) storage in this case, which is heavily taxing 
with respect to how long it takes to complete everything. 
So the number 5 x 10^ of side runs per checkpoint per 
graph cannot in practice be made substantially larger. 
This number, after multiplied by the number of graphs 
in use, is also an upper bound on how many members of 
{0, 1}^ can be observed per checkpoint, so not being able 
to increase it means that the number of nodes n cannot 
be too large, either. All the results we give henceforth 
are then for n = 100. 

We consider three different types of graph. They are 
referred to as type-(i)-(iii) graphs, as follows: 

(i) First is the random-graph model introduced in 
Sec.|lT]as the structural component of our cortical model. 
Generating D from this random-graph model starts with 
placing the n nodes on the surface of a radius- 1 sphere 
uniformly at random and then selecting, for each node, its 
out-degree and its out- neighbors. The choice of A = — 1 
explained in Sec. is specific of the n = 100 case and 
yields N 90. Also, for this value of n the expected in- 
or out-degree in D is about 3.7. Out-degrees are by con- 
struction distributed as a power law, whereas the distri- 
bution of in-degrees has been found to be similar to the 
Poisson distribution (i.e., concentrated near the mean) 

i- 

(ii) Another random-graph model that we use is the 
generalization of the Erdos-Renyi model to the directed 
case [48] . Given the desired expected in- or out-degree, 
denoted by z, generating D places a directed edge from 
node i to node j ^ i with probability z/{n — 1). The 
resulting in- and out-degree distributions approach the 
Poisson distribution of mean z. If z > 1, the graph's 
GSCC encompasses nearly all the graph with high proba- 



bility; that is, N w 100. For consistency with our cortical 
model, we use z = 3.7. 

(iii) At the other extreme from our cortical model are 
the graphs whose structure is deterministic. We use what 
seems to be the simplest possible structure that ensures a 
strongly connected D with a fixed in- or out-degree equal 
to [3.7] —4 for every node, the directed circulant graph 
[49| generated by the integers in the interval [1,4]. If we 
assume that the nodes are numbered through n — 1, 
then node i has four out-neighbors, nodes i + 1 through 
i + 4, where addition is modulo n. For n = 100, the 20 
inhibitory nodes are necessarily equally spaced around 
the directed cycle that traverses the nodes in the order 
0, 1, . . . , n — 1, 0, lest there be a connection between two 
inhibitory nodes. We have N — 100. 

All our results are given for 50 graphs of each of 
types (i)-(iii) and appear in Figs. [2H11 respectively. The 

(a) panels in these figures give the probability distribu- 
tions for the number of occurrences of those members 
of {0, 1}^ that do appear in at least one side run on at 
least one of the 50 graphs for each graph type at the 
eleventh checkpoint. After averaging over the appropri- 
ate 50 graphs for each type, these members number 1 733 
for type-(i) graphs, 4 756 for type-(ii) graphs, and 1033 
for type- (iii) graphs. These illustrate the point, raised 
above, that the need to limit the total number of side 
runs per graph does indeed have an impact on how ca- 
pable our methodology is to probe inside the set of all 
2^ value assignments to the N variables. In fact, the 
absolute majority of assignments are never encountered. 
As for the others, the probability of encountering them 
an increasing number of times decays as a power law [less 
so for type- (iii) graphs]. 

The (b) and (c) panels in the three figures are used to 
show the average information gain G'(X) and total corre- 
lation C(X), respectively, over the 50 graphs of the cor- 
responding graph type at each of the eleven checkpoints. 
Error bars are omitted from the (b) and (c) panels of 
Fig. m because the corresponding standard deviations are 
negligible. Because neither G'(X) nor C(X) can surpass 
the number N of nodes in the graph's GSCC, and con- 
sidering that not all graphs across the three graph types 
have the same GSCC size, all the data plotted in the 

(b) and (c) panels of Figs. [2H1] are normalized to this 
size. The latter, in turn, can be taken as 0.9n for type- 
(i) graphs and n for type-(ii) and type-(iii) graphs. Our 
normalization procedure, therefore, has been to divide all 
G(X) and C(X) values for type-(i) graphs by 0.9, leaving 
them unchanged for graphs of the other two types. 

A different perspective on the results shown in Figs. [2]- 
m is given in Fig. [SJ which presents a scatter plot of all 
150 graphs of the three types, each represented by its in- 
formation gain and its total correlation at the last check- 
point. In this figure, the same normalization described 
above has also been used. 
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FIG. 2: (Color online) Results for type-(i) graphs: (a) the probability that a randomly chosen member of {0, 1} appearing in 
the side runs of the last checkpoint for some graph occurs a certain number of times; (b) the average value of G'(X) at each of 
the checkpoints; (c) the average value of C(X) at each of the checkpoints. 
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FIG. 3: (Color online) Results for type-(ii) graphs: (a) the probability that a randomly chosen member of {0, 1} appearing 
in the side runs of the last checkpoint for some graph occurs a certain number of times; (b) the average value of G'(X) at each 
of the checkpoints; (c) the average value of C(X) at each of the checkpoints. 



VI. DISCUSSION 



The (b) panels in Figs.[2Hlldemonstrate that the G(X) 
averages, after a sharp decrease from the first checkpoint 
to the second, keep on decreasing steadily along the runs 
until stability is eventually reached. A similar trend is 
seen in the (c) panels with regard to the C(X) averages, 
now with increases. (We note that reaching stability in 
either case is on a par with what, in [5|, we showed to 
happen with the distribution of synaptic weights under 
the same cortical model. That is, despite the continual 
modification of the weights as the dynamics goes on, their 
distribution reaches a steady state.) With the exception 
of Fig. 01 standard deviations can be significant all along 
the runs, particularly with regard to total correlation. 



Given this variability, the plot in Fig. [S] is important 
and also quite revealing. First of all, it helps corrobo- 
rate what the (b) and (c) panels of Figs. [2H31 already say 
about algorithm A, which essentially is what drives the 
system toward the eventual joint distribution P over the 
variables Xi,X2, ■ ■ ■ , that is used to compute G'(X) 
and C(X) for each graph. As we discussed in Sec. llVCi 
should all possibilities for P be equally likely, G(X) would 
have a mean value of about 0.6 over all these possibilities 
and would moreover be tightly clustered about this mean. 
The G(X) values appearing in Fig. [S] demonstrate that 
algorithm A, independently of which graph type is used, 
completely subverts the uniformity hypothesis for P and 
leads the system to generate information in amounts that 
surpass the 0.6 mark very significantly. This holds also 
with regard to the C (X) values in Fig. [SJ whose mean un- 
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FIG. 4: (Color online) Results for type-(iii) graphs: (a) the probability that a randomly chosen member of {0, 1}^ appearing 
in the side runs of the last checkpoint for some graph occurs a certain number of times; (b) the average value of G(X) at each 
of the checkpoints; (c) the average value of C(X) at each of the checkpoints. 
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FIG. 5: (Color online) A scatter plot of the 150 different 
graphs used, 50 for each of types (i)-(iii). Each graph is rep- 
resented by its information gain G(X) and its total correlation 
C(X) at the last checkpoint. The straight line below which 
most type-(ii) and all type- (iii) graphs are found goes through 
the origin with slope 0.1. 



der uniformly weighted P's is also bounded from above 
by 0.6. 

Figure [S] also allows us to investigate, for each graph at 
the last checkpoint, how its information gain G(X) and 
total correlation C(X) are related to each other. The 
simplest case is that of type-(iii) graphs, whose topol- 
ogy and inhibitory-node positions are fixed at all times. 
In this case, the stochasticity of initial node potentials 
and synaptic weights, as well as of the functioning of 
algorithm A, are insufficient to yield any significant vari- 
ation in G'(X) or C(X) values. Next are the type-(ii) 
graphs, for which neither topology nor the placement of 
inhibitory nodes is the same for all graphs. What we 



see as a result is significantly more variation in G(X) 
and C(X) values, but with very few exceptions all 50 
graphs are still discernibly clustered relative to one an- 
other. Type-(i) graphs, finally, with their dependence of 
topology upon both a power-law-distributed out-degree 
and the random placement of nodes on a sphere, display 
G(X) and C(X) values that are spread over a signifi- 
cantly larger domain. 

Aside from such broad qualitative statements, it seems 
hard to discriminate among the three graph types by ex- 
amining either G(X) or G(X) values alone, even though 
there are type-(i) graphs for which G(X) is greater than 
for any graph of the other two types, the same holding 
for G(X). One simple, though effective, alternative is to 
resort to the ratio r(X) defined in Eq. ([S]). This ratio 
gives the fraction of all the information generated by the 
system that corresponds to total correlation, that is, the 
fraction that corresponds to information that depends 
on integration among the variables. Once we adopt this 
metric, then the meaning of Fig. [S] becomes clearer: al- 
though all three types of graph are capable of providing 
significant information gain and total correlation, only 
type-(i) graphs (those at the basis of our cortical model) 
seem capable of providing an abundance of instances for 
which r(X) is higher than for most type-(ii) graphs and 
all type- (iii) graphs. 

By its very definition, the ratio r(X) for a given graph 
can be regarded as an indicator of how efficient that 
graph is, under algorithm A, at integrating information. 
Graphs for which r(X) is higher than for others do a 
better job in the sense that, of all the information that 
they generate, a higher fraction corresponds to infor- 
mation that emerges out of the integration among their 
constituents. What our results indicate is that type-(i) 
graphs, based as they are on a random-graph model, can 
be instantiated to specific graphs that are often more ef- 
ficient than those of the other two types. The straight 
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line drawn across Fig.[5]has slope 0.1 and can be used as 
an example discriminator on the 150 graphs represented 
in the figure with respect to efficiency. Specifically, all 
graphs above it are such that r(X) > 0.1. The over- 
whelming majority of them are type-(i) graphs. 

Justifying this behavior in terms of graph structure is 
still something of an open problem, though. We believe 
the justification has to do with the existence of hubs in 
type-(i) graphs, since one of their effects is to shorten 
distances, but whatever it is remains to be made precise. 
One might also think that the way in- and out-degrees get 
mixed in type-(i) graphs could also constitute a line of ex- 
planation, especially because these graphs have markedly 
different in- and out-degree distributions. If this were the 
case then it might be reflected in the statistical proper- 
ties of these graphs' assortativity coefficient (sO], which is 
the Pearson correlation coefficient of the edges' remaining 
out-degrees on the tail sides and remaining in-degrees on 
the head sides. Recall, however, that generating type-(i) 
graph instances, just like generating type-(ii) instances, 
makes no reference whatsoever to node degrees when de- 
ciding which nodes are to be joined by a given edge, so 
the expected assortativity coefficient of graphs of either 
type is zero in the limit of a formally infinite number 
of nodes quite unlike the fixed structure of a type- 
(iii) graph (for which the assortativity coefficient is 1). 
We have verified that this holds by resorting to 1 000 in- 
dependent instances of type-(i) graphs for n = 100 and 
keeping the calculations inside each graph's GSCC. In 
this experiment we also found that the standard devia- 
tion of the assortativity coefficient is of the order of 10^^, 
so there is little variation from graph to graph. 

A possible route to analyzing the role of hubs in giving 
rise to efficient information integration predominantly in 
type-(i) graphs may be to study the joint distributions 
of in- and out-degrees. These distributions are shown 
in Fig. [SI in the form of contour plots, for type-(i) and 
type-(ii) graphs with n — 100. In the figure, all data 
are averages over the inside of each graph's GSCC, so 
in- and out-degrees are expected to be no larger than 90. 
While for type-(ii) graphs, in reference to part (b) of the 
figure, we expect no nodes to exist whose in- and out- 
degrees differ from each other significantly, the case of 
type-(i) graphs is substantially different. First of all, the 
data in part (a) of the figure reveal that the most com- 
mon combination of in- and out-degree at a node is that 
in which the node has a small number of in-neighbors 
(between 2 and 4) and an even smaller number (in fact, 
no more than 2) of out-neighbors. Such nodes function 
somewhat as type of concentrator, meaning that when- 
ever they fire in the wake of the accumulation of signaling 
from its in-neighbors the resulting signal affects at most 
two other nodes. Hubs occur in the opposing end of this 
spectrum. If we take a node to be a hub when it has, 
say, at least 50% of the other nodes as out-neighbors, 
then we see that, though hubs are very rare, when they 
occur they function as a type of disseminator: when they 
fire, they affect substantially more nodes than the hand- 
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FIG. 6: Contour plots of the joint distribution of a node's 
in- and out-degree for type-(i) (a) and type-(ii) (b) graphs. 
Data are averages over 1 000 graphs of each type for n — 100, 
always restricted to each graph's GSCC. 



ful of in-neighbors whose accumulated signals led to the 
firing itself. Perhaps it is the combination of these two 
types of behavior, viz. an abundance of concentrators and 
the occasional occurrence of a disseminator, that explains 
the information- integration behavior we have observed 
for type-(i) graphs jg^]. We expect that more research 
will clarify whether this is the case. 



VII. CONCLUDING REMARKS 

We have introduced a network-algorithmic framework 
to study the emergence of information integration in di- 
rected graphs. Our original inspiration was the IIT of [l|, 
but the resulting framework departs significantly from 
IIT in several aspects, most notably the adoption of 
the asynchronous distributed algorithm of Q to simu- 
late neuronal processing and signaling, and the use of 
binary variables, one for each node in the graph's GSCC, 
to signify whether nodes are reached by at least one mes- 
sage during a run of the algorithm. These variables have 
been the basis on which two information-theoretic quan- 
tities can be computed, namely information gain G'(X) 
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and total correlation C(X). Given the graph's structure, 
the former of these indicates how much information the 
system generates, under the distributed algorithm, from 
an initial state of total uncertainty. The latter, in turn, 
indicates how much of this information is integrated, as 
opposed to the information that each node generates lo- 
cally, independently of all others, denoted by Gi{Xi) for 
node i. For N the number of variables, these quantities 
are related by Eq. ((T)). This equation, with hindsight, 
can nowadays be seen to have been present, at least qual- 
itatively, in most information-theoretic views of system 
organization and structure (cf. (5l| for an early example) . 
If we stick with the basic premise of IIT, that conscious- 
ness and some form of information integration are to be 
equated, then what the equation says is that, of all the 
information that the system generates [G(X)], some re- 
flects conscious processing [C(X)] and some unconscious 
processing E^iGi(Xi)]. 

We have studied the behavior of information gain and 
total correlation for a variety of graphs. These have 
included (i) the random graphs that, as we know from 
our earlier study in [5;j, reproduces some experimentally 
observed cortical properties; (ii) random graphs with 
Poisson-distributed in- and out-degrees; and (iii) the de- 
terministically structured circulant graphs. While we 
have found that many instances of these graphs are capa- 
ble of generating comparable amounts of both informa- 
tion gain and total correlation, those that do so efficiently 
(i.e., with a comparatively high C(X)/G(X) ratio) are 
very predominantly of type (i). In the context of regard- 
ing information integration as consciousness, this seems 
to provide further evidence that the cortical model in- 
troduced in [5] can indeed be useful as a framework for 
the study of cortical dynamics. Another interesting as- 
pect, now related to the actual G(X) and G(X) values 



we have observed, is that the latter are much lower than 
the former. Once again, though, the association of con- 
sciousness with integrated information seems illuminat- 
ing, since the overwhelming majority of all processing in 
the brain is believed to occur unconsciously [53 |. 

As it stands, our framework is only capable of han- 
dling relatively small graphs. The main difficulty is that 
we need to organize statistics of the frequency of occur- 
rence of the various members of {0, 1}^ that appear in 
the runs as they elapse and this requires huge amounts of 
input/output operations on external storage. With cur- 
rent technology, the results presented in Sec. |V] can re- 
quire up to three weeks to complete for each graph. And 
while the potential for parallelism is very great, normally 
one is also limited on the number of processors one can 
count on. An important direction in which to continue 
this research is to address these computational limita- 
tions. Success here will immediately facilitate important 
further research on crucial aspects of our conclusions, 
for example those related to how our finds scale with in- 
creasing system size. We also find our underlying cortical 
model, with its structural and algorithmic components, 
to be suitable for the undertaking of investigations on en- 
tirely different fronts. One possibility of interest to us is 
how to look for, and characterize, the emergence of cer- 
tain oscillatory patterns of cortical activity [53|. It seems 
that the central issue is how to reconcile such oscillations 
with the inherently asynchronous character of our model. 
This, however, remains open to further research. 
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